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Abstract 

Q We give an overview of several aspects arising in the statistical analysis of extreme risks with 

actuarial applications in view. In particular it is demonstrated that cmpiric;al process theory is a 
very powerful tool, both for the asymptotic analysis of extreme value estimators and to devise 
tools for the validation of the underlying model assumptions. While the focus of the paper 
is on univariate tail risk analysis, the basic ideas of the analysis of the extremal dependence 
I— ^ between different risks are also outlined. Here we emphasize some of the limitations of classical 

multivariate extreme value theory and sketch how a different model proposed by Ledford and 
^> Tawn can help to avoid pitfalls. Finally, these theoretical results are used to analyze a data set 

of large claim sizes from health insurance. 

1 Introduction 

In nonlifc insurance, usually extreme events constitute a considerable portion of the total risk 
C\| covered by an insurance company. Therefore, in actuarial practice extreme value statistics (though 
often in a simplified form) has been used for at least two decades to assess the risk of large 
^ claims. Given their exposure to huge claims, it is natural that reinsurers were among the first 
to emphasize the need for appropriate models of losses exceeding high thresholds. While the use 
^ of Pareto distributions and generalizations thereof were advocated early (see, e.g., Schmutz and 
^ Doerr (1998)), the fact that they naturally arise as approximative models for exceedances was not 
always fully acknowledged, but they were often considered yet another useful parametric model. 
This situation has thoroughly changed. Nowadays it is rarely called into question that the 
^ assessment of "tail risks" requires specific methods and that extreme value theory often (though not 
always) offers efficient and mathematically sound procedures to deal with such problems. Moreover, 
several smooth introductions both to general extreme value statistics and to its application to 
actuarial problems have been published; see, e.g., Embrechts ct al. (1997), Beirlant et al. (2004), 
McNeil (1997) and Cebrian et al. (2003). For that reason, the present paper focusses on specific 
aspects which have perhaps not attracted the attention they deserve: 

• We will show that empirical process theory offers a general framework to deal with different 
steps in the risk analysis from model fitting to model validation and the estimation of risk 
measures. 
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• An important step in a prudent risk assessment is to validate the model assumptions on 
which the statistical analysis is based. To this end, graphical tools like qq-plots are widely 
used but the assessment which deviations from the ideal line indicate a violation of the model 
assumptions is largely subjective, and experience from classical statistical applications can 
be misleading if one analyzes heavy-tailed data. Hence in Section |4] it is described how to 
refine such tools to obtain a rigorous statistical test. 

• The analysis of the dependence between different extreme risks has been extensively discussed 
in the recent statistical literature. We will first comment on problems arising when parametric 
copula models are used to this end. Then we discuss how to overcome a serious weakness of 
classical multivariate extreme value statistics. 

As the choice of topics addressed by such a partial survey is subjective, it is inevitable that some 
readers will miss aspects they consider particularly important. Perhaps the most obvious topic we 
only touch on concerns the extreme value analysis of investment risks. Although recent years have 
shown that in some instances the asset side of the balance book contains the most serious risks of 
extreme losses, for several reasons here we will nevertheless focus on "genuine" actuarial problems 
related to the insured risks. Firstly, the statistical and economic literature on extreme investment 
risks is abound. Secondly, though the very basics of extreme value theory needed in this context 
is the same as the one discussed here, a serious treatment of market risks would require a lengthy 
introduction to the extreme value behavior of time series; we refrain from discussing this topic 
in detail in order not to overload the article. Finally, we feel that mathematically satisfactory 
solutions are yet to be developed for important practical problems like the risk assessment for 
complex portfolios. An exposition that cannot go into great details carries the risk of provoking 
a serious misconception of the solutions that the state of the art in statistical theory can actually 
deliver. 

As there are plenty of other important problems we can merely touch on, we will try to mitigate 
this lack by giving references where aspects important in actuarial applications are discussed in 
greater detail. We do however not aim at giving a full overview over the rapidly expanding 
literature relevant in this context. Hence the present text may be best characterized as a tutorial 
with particular emphasis laid on crucial points which, from my personal point of view, have often 
not attracted the interest they deserve. 

The paper is organized as follows. Section [2] gives an introduction into the basics of univariate 
extreme value theory, with particular emphasis on conditional distributions of exceedances (instead 
of the distribution of maxima as in the classical approach). In Section [s] we discuss how to construct 
extreme value estimators of quantities like risk measures and insurance premiums which depend 
only on the tail behavior. Then Section |4] deals with methods to define a tail region depending on 
the data and the purpose and methods of the tail analysis as well as tools for model validation. 
In both these sections, a limit theorem for the tail empirical quantile function proves extremely 
useful. In Section [5] we outline how the dependence structure between the components of a vector 
of risks can be statistically analyzed. In Section [6] the previously introduced statistical procedures 
are used to analyze a data set of large claims in US health insurances. All proofs are deferred to 
the final Section [71 
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2 Basics of univariate extreme value theory 



Classically a synopsis on extreme value theory starts with the analysis of maxima of independent 
and identically distributed random variables (iid rv's). We prefer to discuss the asymptotic be- 
havior of excesses over high thresholds, because these naturally arise as effective claim sizes in 
insurances with high retention levels, while the maximum of claim sizes rarely is an economically 
meaningful quantity. 

In what follows, let X denote a rv defined on some probability space (0, A, P) with cumulative 
distribution function (cdf) F and quantile function F*^ (i.e., the generalized inverse of F). li X 
describes a loss covered by an insurance with retention level u, then 



Fu{x) 



P{X -u<x\X >u) 



is the cdf of the actual claim size. For a very high retention level u (e.g., in an excess of loss 
reinsurance against catastrophes), usually few or none of the losses observed so far exceed u, so 
that standard methods for risk modeling and premium calculation do not apply directly. Of course, 
one could assume a parametric model for all losses, estimate its parameters (provided the full losses 
are observed) and calculate the resulting conditional cdf of the excesses over u. Then, however, 
the fitted model for F^ is largely determined by the bulk of losses that are much smaller than the 
losses of interest that exceed u. Hence such an approach seems advisable only if one is confident 
that the same "stochastic mechanism" generates the moderate losses on the one hand and large 
losses on the other hand, and that all these losses can be well described by the chosen parametric 
model. As this will rarely be justified, it is widely accepted that for modeling F^ one should consult 
only losses which are large, though perhaps still smaller than u. Sometimes this general idea is 
subsumed in the catchy phrase "Let the tails speak for themselves" . 

The basic idea of extreme value theory is to tackle this problem by assuming that, after a 
suitable normalization, the cdf F^ converges to a non-degenerate limit as the threshold u tends to 
the largest possible loss -F'*~(l). More precisely, we assume that for some (measurable) function 
a > there exists a non-degenerate cdf H (i.e., i?(M) ^ {0, 1}) such that 



Fu{a{u)x) 



u 



V a{u) 



< X 



X >u 



H{x) 



(2.1) 



as n t -f'^(l) for all points of continuity x oi H (i.e., Fu{a{u)-) — )• H weakly). It turns out that 
then H is necessarily the cdf of a generalized Pareto distribution (GPD), that is 



H{x) = H^^,{x) 



(l + 7x/o-)-V7 if 



X < 0, 

X > 0, 1 + 7x/o- > 0, 
1 + 7x/cr < 0, 7 < 0. 



Here ffo,cr(a^) is interpreted as lim^_j.o H^^cj{x) = (1 — e~^/°")l[o po)(a;), which is an exponential cdf. 
Note that the scale parameter depends on the choice of the normalizing function a; we can and 
will always assume a = 1 and write instead of H^^i. 

If (2.1) holds, then the conditional cdf F^ can be approximated by H^^^^ with (T„ = a{u), 



provided u is sufficiently large. In that case, the following approximation of the tail of the loss cdf 
F follows: 



1 - F{y) = (1 - F{u)){l - Fu{y - u)) ^ {I - F{u)){l - H^^^Sv " ^)) 



(2.2) 
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for y > u. Of course, this approximation can also be used for thresholds u different from the 



retention level at hand. It is important to note that (almost) always (2.2) is only an approximation 
to the tail and that its accuracy depends on the choice of u. Hence one should avoid considering 
the GPD model to be the "true" one above a certain threshold u. As we will discuss in detail later 
on, there will always be a bias- variance trade-off when choosing a threshold to estimate premiums 
or risk measures. 



The extreme value approach to the analysis of Fu relies on convergence (2.1). Fortunately, 
almost all textbook distributions suggested to model claim sizes fulfill this condition, that can be 
reformulated as 

hm + (,)^ ,>0, (2.3) 

utF^ii) 1-F{u) 7V : \ J 

for some 7 S M. It is easily seen that this condition holds if and only if for d{t) = a(F'*~(l — t)) 
40 a[t) ' 7 



where the right-hand side is interpreted as — log a; for 7 = 0. (Indeed, ( |2.4[ ) holds for all x > 0.) 
The so-called extreme value index 7 largely determines the tail behavior of F. If 7 > 0, then 



the loss distribution is unbounded and (2.4) is equivalent to the regular variation of 1 — F at 00 
and of at 1: 

u^oo 1 — p [u) 

lim^QLzL^ = x-i x>0. (2.6) 

In this case, both F and X are called heavy-tailed. (Notice, however, that in the literature other 
meanings of the term "heavy-tailed distributions" are common, too.) Typical examples are Burr 
distributions, loggamma distributions and t distributions. As the survival function 1—F{x) roughly 
decays as the power function x~^^^ , large losses are the more likely the larger 7 is. In particular, 
the loss has infinite expectation if 7 > 1 and it has infinite variance if 7 G (1/2, 1). 

If the extreme value index is negative, then the loss has bounded support, while for 7 = 
the right endpoint of the loss distribution can be finite or infinite. For most textbook examples, 
including lognormal, gamma and normal distributions, the latter is true. 

This article will mainly focus on the case 7 > 0, that is obviously the most troublesome from 
an insurer's perspective. We will see that in the statistical analysis it is nevertheless sometimes 



better to work with the more general conditions (2.3) and ^2 Ah instead of the simpler conditions 



(2.5) and (2.6), that correspond to the particular choice a{u) = ju. 



We close this section with a brief outline of the relationship to the limit behavior of maxima 



of iid rv's Xi, 1 < i < n, with cdf F. It can be shown that assumption (2.1) is equivalent to the 
convergence of the suitably standardized maxima to the (generalized) extreme value distribution 
corresponding to H, i.e. 

> r^"^^-^"^--^" <xi=G(x) 



lim P\ 



holds for some a„ > 0, 6„ G M and all points of continuity x of the non-degenerate cdf G if and only 



if (2.1 ) is fulfilled with H = (1 + log G)"*". Then it is said that F belongs to the maximum domain of 



attraction (F G D{G) for short), and one can choose 6„ = F'^{1 — 1/n) and Un = a{hn) = a(l/n) 
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3 Univariate tail risk analysis 



To start with a concrete problem, assume that based on observed losses Xi, . . . ,Xn the fair net 
premium of a (working) excess-of-loss (XL-) reinsurance with a cover of c in excess of t is to be 
estimated, that is, the reinsurer has to pay min((X — t)~^,c) of all future claims X exceeding t. 
After a suitable correction for inflation, the random variables Xi, 1 < i < n, shall be regarded 
as iid with some unknown cdf F. If at most a few observations exceed the retention level t, then 
the net premium per loss £'(min((X — t)~*",c)) cannot be directly estimated by the corresponding 
mean. Therefore, we assume that F fulfills the basic condition (2.3) for some 7 G M, so that we 
may approximate the net premium as follows: 



rt+c 

E {mm{{X -t)+,c)) = 1-F{s)ds 



{t+c~u)/a(u) 1 _ ^(^ + a(u)x) 



1 - F{u) 



dx- aiu){l- F{u)) 



{t~u) /a(u) 
(t+c~u) /a(u) 

(1 + 7x)-^/^ dx - a{u){l - F{u)) 

{t—u) /a{u) 

1-1/7 / t + c-n\ 1-1/7 
a[u) J \ a[uj 



(3.1) 



aiu){l-F{u)) 
1-7 



for some suitable threshold u < t, provided that l + ^{t + c — u)/a{u) > 0. If 1 — F{u) is sufficiently 
large, then it can be estimated by the corresponding empirical probability 1 — Fn{u). Hence, 
if one replaces the extreme value 7 and the scale factor a{u) by some estimators 7„ and an{u), 
respectively, then one obtains a reasonable estimator of the net premium per claim. (To obtain 
an estimator for the net premium of the whole XL reinsurance contract, one has to multiply this 
expression with some estimator of the expected number of claims.) 

Estimators of 7 and a{u) that use only exceedances over the threshold u can be motivated 
in a similar way. For example, if one can assume that 7 is strictly positive, then by (2.5) the 



conditional distribution of X/u given X > u may be approximated by a Pareto distribution with 



Lebesgue density 



'(1/7+1 



)/7, X > 1. Ignoring the approximation error and the fact that 



the number N{u) of exceedances is also random, we may estimate 7 by a maximum likelihood 
approach to obtain 



7n := -iTfT-: V log —1(^^,00 
i=l 



(3.2) 



If one starts with condition (2.3) in the general case 7 G M, then the conditional distribution of the 
excesses Xi — u given Xi> u are iid with approximative density h'y{x/a)/a = {l+'^x / a)~^^^'^~^'^^ / a 
for 1 + 7x/(T > with a := a{u). As the resulting approximative likelihood is unbounded for 
7 < — 1, a point of maximum of the loglikelihood — (I/7+I) X^ILi (l+7(^i— 'u)^/'^) —N(u) log a 
on the parameter set {(7,0") | 7 > — 1,cj > — 7maxi<j<„(X — u)^} can be motivated as an 
estimator for (7, a(M)). 

Of course, several other estimators of the extreme value index and the scale factor have been 
proposed; see, for instance, de Haan and Ferreira (2006), Sections 3 and 4. 

The performance of all these estimators crucially depends on the accuracy of the (generalized) 



Pareto approximation in (2.3) (respectively (2.5) in the case 7 > 0) and the choice of the threshold 



u. Too low a threshold will lead to a large bias, because the GPD approximation is inaccurate for 
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the smallest exceedances of u. On the other hand, if u is chosen too large, then the estimators use 
only a very small fraction of the whole sample and thus their variance will be large. In Section 4, 
we will discuss methods to deal with the bias-variance tradeoff in greater detail. 

Because one has to choose the threshold u depending on the data, it seems natural to use 
a large order statistic u = Xn-kn-.n (with Xi-n denoting the ith smallest observation). Then all 
estimators under consideration are based on the A;„ + 1 largest observations and can therefore be 
written as functionals of the tail empirical quantile function 

Q„(t) := te[0,l]. (3.3) 



For example, replacing u with in (3.2) yields the well-known Hill estimator 

1 v 



(if there are no ties). 

Since the parameters 7, a(u) and d{t) are only defined by limit relations like ( |2.3[ ) and ( |2.4[ ), 
the performance of their estimators must be analyzed in an asymptotic framework. (Indeed, there 
are no unique "true" functions a and a, because any function a such that a{t)/a{t) — )• 1 as f | 



also satisfies (2.4).) Because the basic condition (2.4) describes the behavior of only at its 
right endpoint F'*~(l), in the asymptotic setting we must ensure that, while the number of order 
statistics used for the statistical tail analysis tends to infinity, all order statistics tend to -F'*~(l), 
that is, {kn)n<^n is a so-called intermediate sequence satisfying 

k 

A;„ — > 00, — 0. (3.5) 

n 

Moreover, kn should not grow too fast to avoid the aforementioned bias problems due to a poor 
GPD approximation. The precise conditions on kn will be given below in terms of the approxima- 



tion error in (2.4), i.e. 

F^{l-tx) - F^{l-t) 



R{t,x) :-- 



a{t) 7 



As the randomness of all extreme value estimators under consideration is captured by the tail 
empirical quantile function, it is natural first to establish a limit theorem for this process and 
then to conclude the asymptotic behavior of quite general extreme value estimators (or tests) by 
a functional delta method. In what follows, we are focussing on the case 7 > —1/2 and will often 
assume that 7 > 0, which is by far the most relevant case in actuarial applications and helps to 
avoid technicalities. The following limit theorem (Drees, 1998a, Theorem 2.1) is the corner stone 
for the subsequent risk analysis 

Theorem 3.1. If {kn)neN is an intermediate sequence such that for some e > 

k}/^ sup x"'+^/^\R{kn/n,x)\ 0, (3.6) 

0<a;<l+£ 

then for a standard Brownian motion W and all e > we have 

^y,^ Q.it)-F^il-kJn) _t:^. _^ ^,-i..i)wit)),^^,^ (3.7) 

" V a{kn/n) 7 /o<t<i ^ ^^o<t<i 
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weakly in the normed vector space (-C'7,£, || • ||7,e) of functions z : (0, 1] — )• M which are continuous 
from the right with left-hand limits and finite weighted supremum norm 

\\z\U,e := sup 
0<t<l 



It is noteworthy that, under the basic assumption ( 
quences {kn)ne'N such that (3.6) is fuMled. Hence (3.6 



2.4), there always exist intermediate se- 
) is not a condition on F, but it merely 
restricts the speed at which /c„ grows with the sample size. 

Now suppose T : D^^^ — )• M is a scale and shift invariant functional (i.e., T(az + 6) = T{z) for 
all a > 0, 5 G M and z G D^^^) such that T{z^) = 7 for z.y{t) := {t~'^ — l)/7, such that the following 
(Hadamard) differentiability condition holds: there exists a signed measure z^t,7 on (0, 1] such that 



T{Z^ + XuVn) - T{z^) 
An 



y dvT,^ (3.8) 



for all sequences A„ J, and all y„, y G D^^^ satisfying — y\\-y,e 0. Then one may easily deduce 
that 

kU' {T{Qn) - 7) = kl/' (^( ^""CV^^"^"^ ) - ^) ^ / i-'^'^'^Wit) UT,,{dt) weakly, 

\ \ a{Kn/n) / / -^(0,1] 

(3.9) 

where the right-hand side is normally distributed with expectation and variance 

:= / / {str'^'^+^^mm{s,t)iyTj{ds)iyT-y{dt). (3.10) 

J{0,1] J {0,1] 

Likewise, the scale factor a{kn/n) can be estimated by S{Qn), where S : D^^s — t- M is a scale 
equivariant and shift invariant functional (i.e., S{az + b) = aS{z) for all a > 0, 6 G M and z G ^7,e) 
such that S{z^i) = 1 and for some signed measure vs,'^ on (0, 1] 



5(^7 + Xny-n) - S{Z^ 



ydvsn (3.11) 



An 

for all sequences i and ||yn — y|l7,e — ^ 0. Then we conclude 

kllH^^,-l) [ i^(^+i)T^(t)i^.,7(rfO weakly. (3.12) 

\a{kn/n) I 7(0,1] 

(In fact, the joint weak convergence of (3.7), (3.9) and (3.12) holds.) 

As an example, consider the functional (T(2;), 5(z)) defined as a solution (7, a) of the equations 



1 , 1 

dt 



I l + ^{z{t)-z{l))/a'^" 7 + 1 

1 

log {I + -fizit) - z{l))/ a) dt = 7 

with 7 / 0, or as (0,a) if ( /J z{t) - z(l) dtf = /o\z(t) - z(l))^ dt/2 and a = z{t) - z{l) dt. 
Then {T{Qn),S{Qn) is the aforementioned maximum likelihood estimator in the approximating 
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GPD model (or more precisely, a solution of the corresponding likelihood equations). Using the 
methodology sketched above, one can prove that under the conditions of Theorem |3.1| 



2,1/2 



T{Qn) - 7 

S{Qn)/a{kn/n) 



1 



A/'(o,E) weakly with S 



(1 + 7)' 
-(1 + 7) 



-(1 + 7) 
2 + 27 + 72 



It can be shown that these estimators have minimal asymptotic variances among all estimators 
T{Qn) and S{Qn) of the type discussed above. (See Drees (1998a) and Drees et al. (2004) for 
details.) 

If 7 > 0, then one can always choose a{t) 



Z,l/2 



Qn{t) 



F^{l-kn/n) 



0<t<l 



7F^(1 — t), so that (3.7) reads as 
-> (7t"^^+'^VF(0)o<,<, weakly in D 



7,e- 



Hence, in this case, we need not require that T is shift invariant, but merely that it is scale 
invariant, which allows for a wider class of functionals. A prominent example is the Hill estimator 
TniQu) with Th{Z) = log{z{t)/z{l)) dt, that is scale invariant but not shift invariant. The 
Hill estimator is asymptotically normal with asymptotic variance 7^ if condition (3.6) is met for 



jF'^{l — t), which reads as 



sup X 

0<a;<l+e 



.7+1/2 



knx/n) 
- kn/n) 



0. 



(3.13) 



Note that for some cdf's F this condition imposes a much more severe restriction on the number 



of order statistics used for estimation than condition (3.6) for some other choice of the normalizing 



function a. Thus, even if it is known (or assumed) in advance that 7 > 0, it may be advisable to 
use the shift invariant ML estimator in the GPD model instead of the Hill estimator, although the 
latter has a smaller asymptotic variance. 

Example 3.2. Assume that the following expansion of the quantile function holds: 



F^{1 -t) = dit-^ + d2 + d^f-^ + o{t 



(3.14) 



for some 7,p > 0, di > and d2,d^ / with ^2 + 7^ i/ p = 7. Then condition ( 3.13| ) 

ensuring the asymptotic normality of the Hill estimator is equivalent to (A;„/n)™"^^'>''''^ — )■ 
and hence to kn = o(n2'^™('''''')/(2min(7,p)+i)^ ^ contrast, the ML estimator in the GPD model 

is asymptotically normal if condition (3.6) holds, e.g., with the choice d{t) 

1/2 — 
equivalent to kn {kn/n)P — )• and is hence fulfilled for k^ 

case p > J, the ML estimator may use many more large order statistics than the Hill estimator 

before a significant bias shows up. 

Remark 3.3. It can be shown that, in the situation of Example 3.2. the shift invariant estimators 



^dit ^ , which is 
o(n2''/(2p+i)^_ For that reason, in the 



T{Qn) CLnd S{Qn) are still asymptotically normal if kn ~ An^^'/^^P+i) for some A > 0, but then 
the limiting normal distribution is no longer centered. See Drees (1998a) or de Haan and Ferreira 



(2006), Section 3, for similar results under second order refinements of condition (2.4), that are 



generalizations of expansion (3.14) 



A different type of estimators for the extreme value index that explicitly uses these second order 
conditions are discussed in Section 4-5 of Beirlant et al. (2004). These estimators typically have 
a smaller bias in the more restrictive model, but their consistency is not ensured if only condition 



(2.3) or (2.5) is assumed. 
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Next we will demonstrate by the example of an estimator for the net premium of the XL- 
reinsurance discussed above that the asymptotic normality of a huge class of extreme value esti- 
mators follows by straightforward (though sometimes lengthy) computations. Replacing in (3.1) 
the threshold u with Qn{^) = ^n-fc„:n and the unknown parameters by suitable estimators, we 
arrive at 



Un{t,c) := —S{Qn) 



n 



(t+c-Q„{l))/S{Q„) 



:i + T(Q„)x)-i/^(«")dx. 

'(t-Qn(l))/5{Q„) 

As explained above, we are interested in the case that at most a few observations exceed the 
retention level t. To reflect this crucial feature in the asymptotic framework, one must consider a 
sequence of retention levels t = tn which increases with the sample size. More precisely, we assume 

n , XX , l — F(tn + Cn) 

-(l-F(U)^O and -^31^ 



AG (0,1). 



(3.15) 



Despite the quite complex structure of the estimator n„(t„,Cn), its asymptotic normality follows 
from the joint asymptotic normality of T{Qn), S{Qn) and (5n(l) by rather simple Taylor type 
expansions. For simplicity, we focus on the case 7 > 0, but an analogous result can be proved by 
the same methodology for all 7 ^ the case 7 ^ though, the asymptotic behavior of Tin 

also depends on the asymptotics of S{Qn), while it does not play a role in the following result. 

Corollary 3.4. Assume that F G D{G^) for some 7 > 0, that (tn)nGN a sequence of retention 
levels such that (3.15) holds, and let {kn)n(^n be an intermediate sequence satisfying (3.6), 



sup x^\R{kn/n,x)\ = olk^^^"^! 

0<x<l 



and 



with 



Then 



n 



log — (l-F(U) 



o{k'J% 



7>0, 
7 = 



(3.16) 



(3.17) 



log (1^(1 -F(t„))) 1/7, 7>0, 
l^og'{Ml-F{tn))), 7 = 0. 



n 



krl TfiClikfi / "T^) 

with 



n 



(l-i^(in)) 



7-1 



1 - F{s) ds 



AA(o,4 ^) weakly 



2 



1 - Ai-^\2 



■7 



2 



Remark 3.5. (i) From Corollary 3.4 and its proof, one can easily conclude that 



k 



1/2 



'It 



l-F(s)ds 



(0,4,0 



weakly. 



Hence, if fn is a consistent estimator of Tn in the sense that t„/t„ — )• 1 in probability, and if 



a. 



T,7 



is a continuous function of 7, then 



1.1/2 



'''n<7T,T(Q„) 



1 - F{s) ds 

^n(tm Cn) 



(0,1) 



weakly, 
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from which asymptotic confidence intervals are readily constructed. // 7 > 0, then 



„ ^ log(t„/Q„(l)) 

T^Qn) 

is a consistent estimator of Tn- 



(a) In the situation of Example 3.2. condition (3.16) is a direct consequence of condition (3.6) 



In general, though, (3.16) cannot always be fulfilled, if the rate of convergence in (2.4) is 



particularly slow. Since in the proof of Corollary 3.4 this condition is essentially only needed 
to bound the bias term IV, a closer inspection of the proof shows that it can be replaced by a 
weaker, but more complex condition on kn which can always be satisfied. 

By the same approach one can construct estimators of ah risk measures or insurance premiums 
which are smooth functionals of the tail cdf 1 — F{t), t > u, for some large u or of the tail quantile 
function F'*~(l — t), t < rj, for some small ij > 0. For example, the value at risk F'*~(l — a) for 
small a can be estimated by 

:= Q„(i) + g(Qn) ^^"^^:;j^"J^^"^~^ 

(cf. Drees (2003)). Extreme value estimators of reinsurance premiums according to Wang's pre- 
mium principle have been examined by Vandewalle and Beirlant (2006) in the case 7 > (without 
using the tail empirical quantile function explicitly). 

An advantage of the approach via the tail empirical quantile function is that, with rather little 
effort, one can analyze the asymptotic behavior of a large class of extreme value estimators in 
a unified framework, and hence easily compare their performance. Moreover, the same analysis 
immediately gives the asymptotic normality of the estimators if one replaces the assumption of 
independence of the observations by more general condition on the serial dependence structure. 
Indeed, Drees (2003) proved the convergence of the tail empirical quantile function Qn towards 
a centered Gaussian process for stationary time series which satisfy suitable mixing conditions. 
Although all the estimators discussed above can still be used in this more general setting, usually 
their estimation error will be larger than for iid data. In an extensive simulation study. Drees (2003) 
showed that then the actual coverage probability of confidence intervals for extreme quantiles 
constructed on the basis of the theory for iid data can be much smaller than its nominal value. 
Therefore, it is important not to use these confidence intervals when analyzing time series of returns 
on some investment that usually exhibit quite strong a serial dependence. 



4 Selecting the tail fraction and validating the model 

As explained above, in almost all cases there does not exist a threshold u such that the tail cdf 
F(x),x > u, is exactly equal to some GPD tail, but the accuracy of the GPD approximation usually 
increases with the threshold. Consequently, roughly speaking the modulus of the bias of any of 
the extreme value estimators discussed in Section [3] will be a monotonically decreasing function of 
u, respectively an increasing function of k if the order statistic X^-k-.n is used as the threshold. 
(This statement should be taken with a pinch of salt: for very small k the bias sometimes becomes 
larger again, but in an asymptotic setting the monotonicity can be made precise for intermediate 
sequences {kn)n&i-) On the other hand, the variance is an increasing function of u and decreasing 
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function ofk, respectively. Therefore, choosing an "optimal" sample fraction of largest observations 
used for the statistical tail analysis involves a bias-variance tradeoff. Note that this selection does 
not only depend on the data set (or the underlying distribution), but also on the estimator (or 
statistical test) used in the analysis. Moreover, the appropriate balance between bias and variance 
may also depend on the purpose of the statistical analysis: in some applications a non-negligible 
bias may be unacceptable when calculating an insurance premium, while such a bias may be 
admissible if it helps to reduce the variance of an estimator of a risk measure. Thus, for a given 
data set, there does not exist the optimal choice for u or k. 

This said, widely applicable techniques are needed to select the number of largest order statistics 
used in the statistical analysis. The most popular graphical tool is to plot the estimator under 
consideration (based on the largest k + 1 observations) versus k. Typically, the graph will be rather 
wiggly for small values of k, and it will be more or less monotone for large values of k due to the 
increasing modulus of the bias. Hopefully, there is a range in between where the plot is relatively 
stable, indicating that the bias is not yet dominating, but the variance has already decreased to 
an acceptable level. Drees et al. (2000) showed (for the Hill estimator) that it may be advisable 
to plot the estimator versus log/c/logre (as suggested by C. Starica), because usually this graph 
spends a larger portion of time in the neighborhood of the true value. 

Figure [T] shows such plots for the Hill estimator calculated for a sample of n = 1000 iid 
Frechet rv's with cdf F{x) = exp{—x^^^"') on the left-hand side and for a sample of n rv's with 
quantile function F^{1 — t) = {t/\ logt|)~'^ on the right-hand side with 7 = 1/2. The plots for 
the Frechet rv's are quite stable for k around 150 or log A;/ log n about 0.7. In contrast, in the 
right-hand plots for the logarithmically disturbed Pareto distribution, after strong fluctuations in 
the beginning, the graphs immediately show a clear upward trend, and so no plateau is clearly 
visible. This different behavior is caused by the different accuracy of the GPD approximation to 



the tail. While the Frechet distribution satisfies expansion (3.14) with p = 1 leading to the optimal 
rate of convergence when /c„ is of the order n^/^ , in the case of the logarithmically disturbed Pareto 
distribution it can be shown that the squared bias dominates the variance if kn is of larger order 
than log^ n. Thus for the second distribution the increasing bias leads to the clear trend already 
for quite small values of k. 

To give some advice how to choose the sample fraction used in the tail analysis in such cases and 
also to avoid subjective choices which have an influence on the estimation accuracy that is difficult 
to quantify, fully automatic data-driven selection procedures have been proposed that minimize 
the asymptotic mean squared error of the estimators under consideration. Here we consider three 
different methods to choose the number of largest order statistics such that the (asymptotic) mean 
squared error of the Hill estimator 7„ ^ is minimized. See Section 4.7 of Beirlant et al. (2004) for 
a more extensive list and additional references. 

Danielsson et al. (2001) used a bootstrap approach to minimize the mean squared error (MSE). 
Hill (1990) showed that the standard bootstrap does not work here, because it does not capture the 
bias of the Hill estimator (and other linear statistics) properly, but a suitable bootstrap may yield a 
consistent estimator of the MSE in a restricted model. Instead of trying to minimize the MSE of the 
Hill estimator directly, Danielsson et al. (2001) used the auxiliary statistic An^k '■= {Mn,k — 27^ ^)^ 
with 



Mn,k ■■= k-^^\og^{Xn-i+l:n/Xn-k;n)- (4.1) 



i=l 



It can be shown that under a suitable second order condition, that generalizes assumption (3.14) 



11 



a sequence ko^n) which minimizes E{An^k) and a sequence kQ{n) which minimizes the MSE of the 
HiU estimator jn^k have the same asymptotic behavior up to a multiphcative constant. Starting 
from this fact, Danielsson et al. (2001) developed the following algorithm: 

• For some e G (0,1/2), some ni = 0{n^^'^) and n2 := [nf/n], define kQ^Ui), i = 1,2, which 
minimizes the conditional expectation of {M* — 2(7* fc)^)^ given the data Xi, . . . , X„, where 



7* ^ and M*j. are defined as in (3.4) and ( |4.1[ ), respectively, but with n replaced by ni and 
Xi replaced by X* independently drawn from Xi, . . . , Xn (with replacement). Here the 
conditional expectation is minimized over ki £ {[logni], . . . , [ni/ logni]}, say. 

• The asymptotic MSE of the Hill estimator 7^ ^ is then minimal for 

f^boot (^S(^i))' (21ogni/logfcg(ni) - i)2(iogfcS(m)/iogni-i)_ 

The performance of the data-driven choice of k crucially depends on the value ni. Using heuris- 
tic arguments, Danielsson et al. (2001) proposed to select ni which minimizes {Q{ni, kQ{ni)))'^ / 
Q{n2,kQ{n2)) with Q{ni,ki) denoting the conditional expectation of (M*. ^. — 2(7*.^.)^)^ given 
the data. 

Drees and Kaufmann (1998) suggested a sequential procedure that was inspired by the so- 
called Lepskii-method for adaptive bandwidth selection in curve estimation. The basic idea of 
this approach is that too large a difference between two Hill estimators ^n,i and 7^^^ with i < k 
indicates that the latter exhibits a large bias. As the random error of the difference is of the order 
z"^/^, an asymptotically optimal choice of the number of order statistics can be determined from 
the smallest k such that i^^'^\^n,i — 7n,k\ exceeds a suitable threshold. More precisely, Drees and 
Kaufmann (1998) proposed the following algorithm: 

• For some r„ = o(n^''^) such that (log log n)^/^ = o(r„) let 

kn{rn) ■= minj/c G {1, . . . ,n} | max - %,k\ > rA. 



Fix some A,^ G (0,1) such that (log log n)^/^^^^ = o(r„) and calculate a pilot estimate 
7n = 7„ [2Vn+] ^itli denoting the number of positive observations. Then the asymptotic 
MSE of the Hill estimator 7„ ^ is minimal for 



(2p„ + l)-^/^"(27„p„)^/(^^"+^)(^M%) 



1/(1-0' 



with 



Pn ■■= log 





^n,[Xkn(rrL)] 1 







log A 



Here the specific values of r„, ^ and (to a lesser extent) A infiuence the performance of the procedure, 
the authors recommended to choose r„ = 2.57„n^/^, ^ = 0.7 and A = 0.8; see Drees and Kaufmann 
(1998) for further comments on the implementation of this algorithm. 

In yet another approach, Beirlant et al. (2004), Section 4.7.1 (ii), fitted an extended Pareto 
model with an explicit second order correction term to the data using a maximum likelihood 
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estimator. Then they calculated and minimized the MSE of the Hill estimator directly from this 
fit. The resulting estimated optimal number will be denoted by k^^ ■ 

In Figure [T] the resulting estimates for the optimal number of order statistics are indicated by 
vertical lines. While the bootstrap (dashed line) and the sequential approach (solid line) both 
yield reasonable values, the method which uses an explicit model for the second order term leads 
to too large values and thus a considerable bias. Indeed, it has been observed in literature that 
for moderate sample sizes it is notoriously difficult to estimate the second order parameters like p 



in expansion (3.14). In contrast, the bootstrap and the sequential approach both use estimates of 



this second order parameter only in an estimate of a multiplicative constant, while their order of 
magnitude does not explicitly depend on such an estimate. Hence they yield reasonable estimates 
even if this second order parameter is fixed (e.g., p = 1) and misspecified. 

Once the sample fraction of largest order statistics has been chosen, one should check whether 
it can be well approximated by a (generalized) Pareto distribution. A classical graphical tool for 
such an model validation is the qq-plot. If G D{G^) for some 7 > and k is chosen not too 



large, then using (2.6) one can approximate 



log 



+ l:n 



log 



F^(l 



{i - l/2)/n) 



n—k:n 



F^(l- {k + l/2)/n) 



-7 log 



1/2 



A; + l/2 



Hence the points ( — log((i — 1/2) /{k + 1/2)), \og{Xn~i+i:n/ Xn-k-.n)) should approximately lie on 
the line with slope 7n = T{Qn) through the origin. To assess whether the observed deviations of 
the points from this line are probably only due to their randomness or whether they indicate that 



the GPD approximation is inaccurate, we can again use Theorem 3.1 



Corollary 4.1. Assume that {kn)nefi is an intermediate sequence such that condition (3.13) holds 
for some 7 > 0. Then for all e > and all scale invariant functionals T on -D^,£ satisfying 



T{z^) = 7 and the differentiability condition (3.8), we have 



^ + T(Q„)logt; 

Qn(l) /0<t<l 



-f{t-'^W{t)-W{l))+ [ s-^^+^^W{s)uT^{ds) -log t 

7(0,1] 



0<i<l 



(4.2) 



weakly in [Dq^^, \\ ■ ||o,e)- Hence, 



P< max h 

I l<i<kn 



^^l/2_ 

kn + 1/2' 



log ^^^^ + r(Q„) log ' 



1/2 



X 



n—kn'.n 



kn + 1/2 



>.} 



P\ sup h{t) -f{t''^W{t) -W{1))+ [ s~(^+^%(s)z^T,7(ds) -logt >c| 
'-0<t<l J(0,1] ' 



(4.3) 



for all continuous functions h : (0, 1] — t- (0, 00) such that h{t)t (^/^+'^) — as t J, for some e > 0. 
In particular, for T = Th (i.e., T{Qn) equal to the Hill estimator), we obtain 



P\ max h(^—^) 
l<i<k„ Vfc„ + l/2^ 



p\ sup h{t) t-^W{t)-Wil)+ [ s-^W{s) -Wil)ds ■ log t > c|. (4.4) 

0<t<l 7(0.11 
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Under slightly different conditions, a similar result has been proved by Dietrich et al. (2002) 
for the Hill estimator. 



Using (4.4) one can turn the Pareto qq-plot into a statistical testing tool with given asymptotic 



size a. To this end, for some function h satisfying the conditions of Corollary 4.1, using Monte 



Carlo simulations, one determines a critical value Cq. such that the probability on the right-hand 



side of (4.4) equals a. Then with probability of about 1 — a all points of the Pareto qq-plot 
should lie in the band defined by the graphs of the functions T//((5„)(logt ± Ca/h{t)) if the GPD 
approximation is accurate enough so that the bias of the Hill estimator is negligible. The choice 
of the function h determines in which part of the qq-plot deviations are most easily detected: the 
larger h{t) is, the more narrow is the band at that point. Because of the condition — )• 
as t ^ 0, the band always widens for small values of t, thus allowing for larger deviations of the 
most extreme points of the qq-plot from the ideal line. 

It has been suggested to use such tests also to select the tail fraction to be analyzed by increasing 
k until the test rejects the GPD hypothesis (see, e.g., Dietrich et al., 2002, Remark 2). In some 
applications this approach may be problematic if a is chosen as small as it is common in testing 



(e.g., a = 0.05). Note that the limiting Gaussian process in (4.2) tends to as t tends to 1, but 
that the function h is assumed continuous and hence bounded, so that deviations of points of the 
qq-plot from the ideal line near t = 1 (corresponding to the smallest order statistics taken into 
account) are usually difficult to detect. Hence it may happen that one increases the number k so 
much that the Hill estimator (and other extreme value estimators) are strongly biased, before the 
tests acknowledges that the last order statistics taken into account are poorly fitted. (The same 
argument also applies if L2-type tests like the one examined by Dietrich et al. (2002) are used.) 

To avoid such effects, one might think of choosing a weight function h that tends to oo as t 
tends to 1 to compensate for the decrease of the modulus of the limiting Gaussian process. For 
instance, as this process has the variance function cT^(t) = — 1 — log^ t \i T = Th, one might 
be tempted to use a weight function of the form h{t) = (i(l — t)Y/a{t) for some small e > 0. 



Unfortunately, without additional conditions on the smoothness of F*^ , convergence (4.4) need 



not hold for such a choice of h. The reason is that under the general condition (2.5) small jumps 
of F'^ (or continuous small but rapid changes) are still possible which lead to an "unusually" 
irregular behavior of the tail empirical quantile function Qn near 1. If one strengthens condition 



(2.5) to a regularity condition on the quantile density function (F^)'(l — t) = f{F^{l — t)), 



though, then assertion (4.2) may also be strengthened. It is well known that if F*^ has a Lebesgue 



density which is monotonically decreasing in a neighborhood of 1, then by Karamata's theorem 

tf(F^(l-t)) 



(cf. de Haan and Ferreira (2006), Proposition B.1.9 11.). If we replace condition (3.13) with a 



condition on kn in terms of the function rj, then convergence (4.2) can be made more precise in a 
neighborhood of t = 1 . 

Corollary 4.2. Assume that {kn)neN is an intermediate sequence such that for some 7 > 

kl/^ sup |r/(t)| ^ (4.5) 

0<t<{l+e)k„/n 



and that the functional T satisfies the conditions specified in Corollary 4-F Then for all e > 
and for all continuous functions h : (0,1) — t- (0, 00) such that h{t)t^^^'^^^ as t I 0, and 
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h{t){l - as 1 1 1; one has 

ki/^ (^hit) ( log 1^ + T(Q„) log t) l(o,i-i/(2fc„)] it) 

(Kt)(l{t'^W{t) -W{1))+ [ s-^^+^^W{s)uT^{ds)-\ogt)] (4.6) 
V ^ J (0,1] ' ^/o<t<i 



0<t<l 



weakly with respect to the supremum norm. In particular, convergence (4.4) holds for T = Th- 

In Section [6] this result is applied to construct a "confidence band" for the Hill-qq-plot based 
on claim sizes from health insurances. 

If one uses some estimator of 7 different from the Hill estimator, then usually the probability 
on the right-hand side of (4.3) is a continuous non-linear function of the unknown extreme value 
index 7. In that case, one can still construct tests with prescribed asymptotic size a for the null 
hypothesis that (3.13) holds. To this end, one first estimates 7 consistently by 7„ = T(Qn) and 
then, using Monte Carlo simulations, one determines a critical value Cq, such that the right-hand 
side of (4.3) equals a when 7 is replaced with 7„. 

Since the supremum is difficult to simulate, it seems natural to simulate the limiting process 
Z{t) = t-^W{t) - W{1) + /(oy s-('>+^)iy(s)z/T,7(o?s) • logt on a fine grid ti,l < i < m, and 
then to approximate the supremum by maxi<j<m This can still be computationally 
challenging if one tries to approximate the integral of s^^'^^'^^W{s) using some quadrature formula, 
because the integrand is unbounded in a neighborhood of zero and a large number of integration 
points may be needed to obtain an accurate approximation. Fortunately, for most estimators, one 
can avoid numerical integration using the fact that conditionally on l^(tj), 1 < i < m, the integrals 
Jj.^ ^ ^ J s~^'^^^^W{s) VT^'y{ds), 1 < i < m, (with to ■= 0) are independent normal rv's with mean 

IJ,i := / s'"' UT,-y{ds) 



W{ti) - Wit^-l] 



+ / S 

-i,t 



-(^+1) ,.T,^ids)(w{t,^i) - ^^(Wit,) - W{t,,,))) 
\ ti — ti-i / 



and variance 



o-f := - — \ / / (st) '•''^^^\miii{s,t) -ti-i){ti-m.SiX.[s,t))vT,^{ds)vT,^{dt). 

This statement, in turn, follows from the conditional independence of the processes (W{t))ti_j^<t<ti 
which have the same conditional distribution as 

w{ti.i) - {w{ti) - w{ti^i)) + w{t - ti_i) - I Wit^ - 

ti ti—\ Ti ^2—1 

with W denoting a Brownian motion independent of W. Hence the limiting Gaussian process Z 
can be simulated as follows if integrals of power functions with respect to 2^7^,7 can be calculated 
analytically: 

(i) Simulate m independent centered normal rv's Aj with variance ti — ti-i where to := and 
tm ■■= 1, and let W{ti) := A^. 
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(ii) Simulate a normal rv / with mean "^l^i ^J^i and variance X^^^ erf. 

(iii) Then {l{tT^W{ti) - W{tm))+nogU)) 

i<i<m ^ (pseudo) realization of (^(ti))i<j<m- 

Note that the variance of / does not depend on W. In case of the Hill estimator and equidistant 
design points ti = i/m, one has 



m 

1=1 
m 

i=l 



m 

(1 + logm)Ai + J]] (^log y + 1 - (i - 1) log Ai 



i=2 

1- - J^i(i + l)log2^. 



i=l 



The Corollaries 4.1 and 4.2 describe tests for the null hypothesis that the left-hand sides of 



(3.13) and (4.5), respectively, are negligible. Likewise, one can devise analogous tests for the 
validity of condition (3.6), but the resulting limiting process is more complicated. 



Corollary 4.3. Suppose that {kn)n&i '^^ intermediate sequence satisfying condition (3.6) for 
some 7 > 0, and that S, T : D^^^ — )■ M are scale and location invariant functionals such that 

e > 



1, T{z.y) = 7 and the differentiability conditions (3.8) and (3.11) are met. Then for all 

Qn{t) - Qn(l) 



-T(Q„) 



S{Qn) 

,-(7+1) 



+ 



1 



0<t<l 



W{s) VTaids) 



l-rT(l + 7logt) 



_ W{1) 

(0,1] 



t-T - 1 

7 



0<t<l 



'(0,1] 

weakly in {D^^^, \\ ■ W-y^e) (with (1 — t~'^(l + 7logt))/7^ interpreted as (log^ t)/2 for 7 = 0/ 

The proof is omitted as the assertion follows readily from Theorem 3.1, (3.10), ( 3.12[ ) and a 
Taylor expansion of 7 1— >• {t~'^ — l)/7. 

Similarly as above, from Corollary |4.3| one may construct bands around the fitted generalized 
Pareto quantile function Q„(l) + 5(Q„)(t-^('3n) in which all the points {{i-l/2)/{kn + 

1/2), Xn-i+i-n) should lie with probability of about 1 — a. 



5 Analysis of the extremal dependence 

In recent years, the analysis of the dependence between the extremes of the components of a 
random vector of risks has attracted much attention. If the random vector describes returns on 
different assets, then it is obviously important to assess the risk of large losses in different assets at 
the same time. However, extremal dependence also matters in the analysis of claim sizes from one 
customer in different lines of business. For example, if both a building and its content are insured 
with the same company, then a fire will often lead to large claims in both lines of business. Likewise, 
some dependence can be expected between the claims in different types of health insurances (e.g. 
inpatient and outpatient cover); cf. Section [6] 

In order not to overload this presentation, we will only sketch the basic theory used in the 
analysis of extremal dependence, but we will rather discuss some pitfalls and problems that may 
arise in applications in more detail. For simplicity, we mainly consider bivariate vectors {Xi,X2) 
of claim sizes (or risks). 
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Analogously to our basic assumption (2.1) in the univariate setting, in classical multivariate 



extreme value theory it is assumed that the conditional distribution of the suitably standardized 
random vector given that its norm exceeds a high threshold converges to a non-degenerate limit as 
the threshold increases. However, as the components of the vector need not be of comparable size, 
the marginal distributions are usually first standardized, e.g., to the standard Pareto distribution: 



Y :-- 



1 



1- Fi{Xi)Ji<i<2 



(5.1) 



Here we assume for simplicity that the marginal cdf's Fi, 1 < I < 2, of X are continuous. 



Now fix some norm || • || on M and suppose that 



V u 



\Y\\ > u 



P^{-) weakly 



(5.2) 



as w oo for some non-degenerate limit distribution . In that case, Y and are said to be 
multivariate regularly varying. 

It can be shown that this condition does not depend on the specific choice of the norm, while 
the exact form of the limit distribution does. For example, if one works with the maximum norm 



then (5.2) is equivalent to 



P{Yi > uyi or Y2 > uy2 \ Yi>uoiY2>u) 



1 - FY{uyi,uy2) 

1 — Fy{u, u) 
P{Zi > yi or Z2 > 2/2} 
1 - H{yi,y2) 



(5.3) 



for all points (1/1,2/2) £ [l^oo)^ of continuity of H. 
Applying the (generalized) polar transformation 



one can conclude from (5.2) that 



P{{\\Y\\,^\e 



\Y\\ > u 



P=(^)(-) weakly. 



Now standard arguments from the theory of regular variation show that the limiting distribution 
p=(^) must be a product measure with first factor equal to the standard Pareto distribution and the 
second factor being some distribution $ on the upper right quadrant := {z G [0, co) \ \\z\\ = 1} 
of the unit "circle" with respect to the norm || • ||. Unlike in the univariate setting, all possible limit 
distributions do not form a parametric family, because the so-called spectral probability measure ^ 
can be any distribution on satisfying the condition 



oji ^{du) 



UJ2 (^{duj), 



IS+ Js+ 
that follows from the fact that all marginal cdf's are standard Pareto. 



(5.4) 
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Remark 5.1. In the literature, instead of (5.2) often the equivalent assumption 



uP| — G • I — > V vaguely as n —t- oo (5.5) 



is considered, where u is some Radon measure on [0,oo]2\{(0,0)} (see e.g. Resnick (2007), Section 
6.1). Similarly as before, one can conclude that the measure induced by the polar transformation 
is the product of the measure with Lebesgue density t i— )■ t~'^ on (0, oo) and a finite measure 6 on 
S~^ , the so-called spectral measure. The latter is related to the spectral probability measure via 

$ = and ^ 



^{S+) fg+uji^{du;)' 
For an arbitrary measurable set A C (0, oo)^ one has 



u 



oo 



\Y\\>u) — ^ J J U(tw)t-2dt$(da;) (5.6) 



as « — )• oo, provided the set A is continuous with respect to the limit measure (that is the right-hand 
side equals if ^ is replaced with its topological boundary). Hence, to estimate the probability 
that some future observation X falls into a given extreme set C (e.g., C = (xi,oo) x (x2,oo) 
describing the event that the claim sizes in both lines of business exceed a given high threshold) 
can be estimated as follows: 

(i) Estimate the marginal cdf's Fi by Fi; for the estimation of the tails, the methods discussed 
in the previous sections can be used. 

(ii) Transform the data Xi = {Xi^i,Xi^2), ^ ^ i ^ n, and the extreme set C using the fitted 
marginal cdf's: 



■l-Fi{X,,i) l-F2{Xi,2 

1 r/ 1 1 



l-F(C) '-Vl-Fi(xi) 1-F2(x2: 



(iii) Fix some norm and estimate the corresponding spectral probability measure $ by $ (see 
below) . 



(iv) Use ( |5.6[ ) and the regular variation of ||y|| to approximate 



F{C). 

P[y^ ^tAfTTT, I ril > ru) . C^'lrl'^^"^ • P{||y|| > u] 



1-F(C7)I" " ; P{\\Y\\>u] 

/ / li/ii^F{C)){t^)t~^dt^{du:)-r-^ ■P{\\Y\\>u} 
Js+ Jl 

p poo -j^ ^ 

/ / l,/(,„^(^))(Ht-2dt|.(d^)T-l.-^l 



. , .:M >uy 
1=1 
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Here we have assumed that \\y\\ > ru for all y G 1/(1 — F(C)), while on the other hand 
ru is sufficiently large such that (5.6) (with ru instead of u) yields a good approximation. 
Moreover, on the one hand u must be sufficiently large such that -P{||y|| > ru}/P{||y|| > 
u} ~ r~^, while on the other hand it must be sufficiently small such that P{||y|| > u} can 
be well estimated by its empirical counterpart. 

As the family of all possible spectral (probability) measures is nonparametric (infinite dimen- 
sional), it is substantially more complicated to estimate (or <I>) than to fit the tail of a univariate 
cdf. In the last decade, though, a variety of nonparametric estimators of the spectral measure and 
related functions, that also characterize the extremal dependence structure, have been suggested 
and analyzed; see, for instance, de Haan and Ferreira (2006), Chapter 7, Beirlant et al. (2004), 
Chapter 9, or Einmahl and Segers (2009). These estimators also use marginally transformed ob- 
servations Yi, but, unlike in program to estimate the probability of extreme events outlined above, 
here one may use fully nonparametric estimators of the marginal cdf's, which amounts to work- 
ing with the coordinatewise ranks of the original observations. In the analysis of the asymptotic 
behavior of these estimators, it is important to not assume that the marginal cdf's are known, 
because usually the transformation of the data using estimated marginal cdf's (rather than the 
true ones) does have a non-negligible influence on the estimation error for <I>; ignoring it may lead 
to a wrong assessment of the estimation accuracy (see, e.g., Einmahl and Segers (2009), p. 2960). 

As an alternative approach, it has been suggested to assume some parametric submodel of 
spectral (probability) measures and then to fit this model to the transformed observations using 
maximum likelihood or a generalized moment method. Sometimes it is even assumed that, for some 
$ from this parametric family, in (5.6) equality holds for some sufficiently large u. We consider this 
approach quite problematic, because it will rarely be possible to argue for a specific parametric 
family of spectral measures based on either experience with similar data sets or some "physical" 
reasoning about the process generating the extremal dependence. Instead, the parametric families 
are almost always chosen with mathematical convenience in view, while it is argued that the family 
is sufficiently flexible to capture many different dependence structures. Then, however, one merely 
trades the random estimation error, that can be quantified in the nonparametric framework, for 
the risk of a model misspecification, that can hardly be assessed. Hence the seemingly increased 
estimation accuracy which comes with the parametric approach if the model is correct will often 
be just a chimera, which possibly leads to an assessment of the insured risk which is not prudent 
anymore. 

An even more restrictive modeling approach is related to a reformulation of the multivariate 
regular variation in terms of copulas. Any multivariate cdf F with marginal cdf's Fi, 1 < / < d, 
can be represented as F{xi, . . . , xi) = C{Fi(xi), . . . , Fdixa)) where C is a so-called copula, i.e. 
a multivariate cdf with uniform margins. If all Fi are continuous, then C is unique: it is the 
joint cdf of {Fi{Xi), . . . , Fd{Xii)). Thus, the cdf Fy of Y defined by (5.1) equals -Fy(yi,y2) = 
C(l — 1/yi, 1 — 1/2/2) and convergence (5.3) is equivalent to 

1 - C(l - tyi, 1 - tya) TTf -1 -u 

hm — r — = 1 — Hly, , ) 

40 l-C{l-t,l-t) ^ ^ 

for all points {yi^,y2^) of continuity of H. Hence the parametric approach outlined above boils 
down to assuming that, on a small neighborhood of the point (1,1), the copula C can be well 
approximated by a parametric model. 

Because the estimation of a general copula is essentially as difficult as the estimation of a general 
multivariate cdf (in particular it is also plagued by the well-known "curse of dimensionality" ) , it 
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has been suggested to assume parametric models for the whole copula C. This approach, however, 
does not only introduce a modeling error that is difficult to quantify, but it contradicts the general 
philosophy of extreme value theory to "let the tails speak for themselves". In contrast, while 
the estimation error of the aforementioned nonparametric estimators of the extremal dependence 
structure can be quite large (in particular in higher dimensions), it can be well assessed under weak 
model assumptions and thus it can be taken into account by the risk manager. For that reason, 
with the rare exception of those situations when there are convincing arguments that a particular 
family of copulas contains the true one (and not just a crude approximation to it), one should 
take the utmost care in analyzing the tail risk using parametric copulas, in particular in actuarial 
applications where prudence is a time- honored principle. (In this context, the interested reader is 
advised to study the article by Mikosch (2006) and the pertaining discussion for an enlightening 
and entertaining argument about the pros and cons of copula modeling.) 

While the program sketched above in four steps will often yield a reasonable assessment of the 
risk that a future observation falls into some given extreme set if some nonparametric estimator of 
$ is used, there is one important case in which the result can be quite misleading. Suppose that 
large values of one component of the transformed vector Y of risks do not usually coincide with 
large values of the other component, or more precisely that 

P(l2 >u\Yi>u) (5.7) 

as u — )• oo. In that case, Xi and X2 (or Yi and Y2) are said to be asymptotically independent. 



Straightforward calculations show that then the limit measure v in (5.5) has no mass on (0,oo]^, 
i.e., it is concentrated on the axes. Hence, $ and $ have mass only in the points (0, 1) and (1,0) 
if one uses one of the usual p- norms on with p S [1, 00]. (Indeed, because of the normalization 



constraint (5.4), $ must be the uniform distribution on {(0, 1), (1, 0)}.) But then for all sets A 
that do not intersect with any of the axes, the limit in ( |5.6| ) is 0, which usually is too crude an 
approximation for the left-hand side. 

Note that asymptotic independence is a property of the copula of X. Many popular parametric 
families of copulas allow for asymptotic independence for suitable parameter values. A thorough 
analysis of the tail behavior of so-called Archimedean copulas both in the case of asymptotic 
independence and of asymptotic dependence can be found in Charpentier and Segers (2009). 

To obtain more useful approximations, one has to specify the rate at which P(l2 > u\Yi> u) 
tends to 0. More precisely, one assumes that for some nontrivial function d 

P{Yi> uyi,Y2> uy2] 

— ^ > d{yi,y2) (5.8) 



P{Yi >u,Y2> u} 



as u —7- 00 uniformly for all points (^1,1/2) with max(yi,y2) = 1- As an immediate consequence, 
one obtains the regular variation of the function u 1— )• P{Yi > n,y2 > ^^1 = ^'{min(Y'i, ^2) > u}: 

P{mm{Yi,Y2) > ux} ^^^/^ 



P{min(yi,y2) > u} 



x''l-^ (5.9) 



as li — 7- c« for all yi,y2 > and some 77 G (0, 1] which is called the coefficient of tail dependence. 
Moreover, the limiting function d is homogeneous of order —l/r]. Since Yi and Y2 are asymptotically 
independent whenever P{Yi > u,Y2 > u} = o{u), in particular one has asymptotic independence 
a r] < 1. If Yi and I2 are exactly independent, then rj = 1/2, while, roughly speaking, values of r] 
between 1/2 and 1 indicate a positive, but asymptotically vanishing dependence between the large 
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values of Yi and Y2 . A slight modification of this model was first suggested by Ledford and Tawn 
(1996,1997). 

To construct estimators of the coefficient of tail dependence, let Y^)^^ := (n + l)/(n + 1 — Ri^i), 

" (n) 

l<i<n, l</<2, with Ri^i denoting the rank of Xij among Xi^i, . . . ,Xn,i- Hence Y^^ = 
1/ {1 — Fi{Xi^i)) where Fi is essentially the empirical cdf of Xi . . . , X„ with a minor modification 



to avoid division by 0. In view of (5.9), the rv's 



Tr^=^{Y^.Y^) (5.10) 



have approximately a Pareto tail with extreme value index 77, that can be estimated by one of the 
usual estimators discussed in Section s] applied to T-^"\ 1 <i <n. Draisma et al. (2004) proved an 



analog to Theorem 3.1 for the tail empirical quantile function pertaining to these rv's. It turned 
out that in case of asymptotic independence one obtains the same limit as in Theorem 3.1, so 



that also all the results on the extreme value estimators discussed in Section [3] carry over to the 
present situation (although the rv's T- are not exactly independent). If the components are not 
asymptotically independent, one can still conclude the asymptotic normality of the estimators of 
r/, but the formulas for the asymptotic variance are more complicated and depend on the positive 
limit of P{Y2 > u\Yi> u). 

Drees and Miiller (2008) proposed the estimator 

1 " 

d(yi,2/2) := — X^lrv ^rr{n) ^ ^ ^(n) . (5-11) 



of the limiting function d{yi,y2) in (5.8) and proved uniform convergence of the suitably stan- 
dardized estimation error d — d towards a centered Gaussian process under suitable smoothness 
conditions on d. Moreover, they derived statistical tests and a graphical tool to validate the model 



assumption (5.8). Finally, the theory was applied to two well-known bivariate data sets of claim 
sizes, the first describing losses to buildings and losses to their content in Danish fire insurances, 
while the second was taken from the Society of Actuaries Group Medical Insurance Large Claims 



Database (cf. Section |6j). In both cases it seems very likely that condition (5.8) holds with a 
coefficient of tail dependence r/ less than 1. As the point estimates of r/ were larger than 1/2, the 
claim sizes in the different lines of business are probably asymptotically independent, though with 
a non- negligible positive dependence for finite levels. Applying classical bivariate extreme value 
theory in such cases will usually lead to a wrong assessment of the true risk insured. 

So far, the extremal dependence structure has been analyzed in terms of the joint distribution 
of the observations after a standardization of the marginals to some fixed distribution (stan- 
dard Pareto or uniform) . Consequently both coordinates of the random vector have been treated 
symmetrically. As an alternative, in recent years the so-called conditional extreme value (CEV) 
approach has been considered, where the asymptotic behavior of one component (after a suitable 
linear normalization) given that the other component is large is investigated. For example, Abdous 
et al. (2005) and Abdous et al. (2008) considered the limit behavior of 

P{X2 < a{xi)x2 + b{xi) I Xi > xi) (5.12) 

as xi — )• 00 for elliptically distributed vectors (Xi, X2), and Fougeres and Soulier (2010) determined 
such limit distributions for more general bivariate distributions in polar representation. Das and 
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Resnick (2011) examined possible limits in (5.12) in a general framework of regular variation on 
cones and discussed the relationship to the approach by Ledford and Tawn. In the special case 
a(xi) = 1 and b{xi) = (coined standard regular varying case by Das and Resnick), the CEV 
approach facilitates the analysis of the extremal dependence between large claim sizes in absolute 
terms rather than just a comparison the behavior of the conditional distribution of one claim size 
given that the other is large relative to its unconditional distribution as in the approach by Ledford 
and Tawn. Unfortunately, in most applications one needs different normalizing functions a and b 
to obtain a non-degenerate limit of (5.12). 

It is worth mentioning that the methods outlined in this section are only useful to analyze the 
dependence between extreme claim sizes (or losses) observed in a small number of different lines of 
insurances (usually sold to the same customer). From an economic point of view, the dependence 
between different risks insured in the same line of business is often more important. For example, 
in property insurance a large storm will usually result in many claims from customers living in the 
same area. With the present state of the art, extreme value theory has little to offer to analyze the 
extremal dependence in such situations. Instead, approaches using expert knowledge (e.g., from 
meteorology) remain the methods of choice. 



6 Analyzing Large Claim Sizes in Health Insurance 

In the years 1991 and 1992 the Society of Actuaries collected large claim sizes (totalling $25,000 or 
more) in US health insurances. The resulting large claim size database is available at the website 
http://w'ww. soa.org. For each claimant, hospital charges and other charges were recorded in each 
year together with the type of the health insurance plan and the status of the claimant (employee 
or dependent), among other information. See Grazier and G'Sell Associates (1997) for a detailed 
description of the data set (see also Cebrian et al. (2003) for a statistical analysis of the total 
charges). 

A closer inspection of the data reveals that the structure of the claim sizes depend on the status 
of the claimant and the type of the health insurance plan. For example, the large non-hospital 
costs were significantly larger for HMO (health maintenance organization), POS (point of service) 
and indemnity plans than for PPO (preferred provider organization), EPO (exclusive provider 
organization), comprehensive and other indemnity plans as well as for those records for which the 
type of plan was unknown. Therefore, as an example here we analyze the claims for the second 
group of health plans in the year 1992 when the claimant had the status 'dependent'. 

The sampling scheme that only those claims with total costs of at least $25,000 were recorded 
introduces an artificial negative dependence between both components: if the hospital charges were 
small, say less than $5,000, then the other charges must be larger than $20,000, and vice versa. 
For that reason, here we only consider those records for which both type of charges were at least 
$25,000, leading to a sample of size n = 1959. (We discuss the consequences of this choice at the 
end of the section.) 

First we fit Pareto distributions to the marginal tails. Figure [2] shows the maximum likelihood 
estimator (in the GPD model) for the extreme value index of the hospital charges (solid line) and 
the Hill estimator (dashed line) as a function of k (i.e., the number of largest observations used 
for estimation reduced by 1). Obviously, the graph of the ML estimator is much more stable than 
that of the Hill estimator. While for the former it seems reasonable to use at least 1400 largest 
observations, the Hill plot increases more or less monotonically for k > 300, say. Moreover, the 
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data-driven procedures to select an optimal number of order statistics yield very small values of k 
for the Hill estimator (A;^""* = 38, k^^ = 26). 



As explained in Example 3.2, this qualitatively different behavior may be due to a constant 
term in the GPD approximation of the tail, that does not influence the shift invariant ML estimator 
but that leads to a large bias of the Hill estimator if k is chosen too large. In fact, the ML estimator 
fits a shifted Pareto distribution with location parameter of about —3 • 10^ for a wide range of 
fe-values. If one adds $300,000 to each of the hospital charges, then the Hill plot (displayed by the 
dash-dotted line in Figure [2]) becomes very flat so that at first glance the Hill plot suggests that 
one may use almost all observations to estimate 7. So apparently for this data set the instability of 
the Hill estimator is indeed largely due to its sensitivity to shifts. For the shifted data the optimal 
number of order statistics estimated by the bootstrap and the sequential procedure sketched in 
Section [4] suggest to use the 143 and 134 largest order statistics, respectively, resulting in a Hill 
estimate of about 0.22. 

Figure [3] shows the qq-plot for the shifted data together with the line with slope equal to 
the Hill estimate ^^l. 0.286 for k = 133. Moreover, the functions — 7^^],(logt it Ca{t~^ — 1 — 

(logt)^)^^^(t(l — t))"^/^'^) are displayed as dashed lines. Here, Ca = 2.78 was calculated by Monte 
Carlo simulations as described in Section |4] such that the probability that all points of the qq-plot 
lie between these graphs is about 1 — a = 0.95. (Strictly speaking, the probability is probably a 
bit higher, because the band does not take into account the fact that the shift has been chosen 
depending on the data to improve on the Pareto fit.) Indeed, the fit is reasonably good and all 
points lie in the band bordered by these functions. However, for the most extreme points the qq- 
plot flattens out, indicating that perhaps the fitted Pareto tail slightly overstates the actual risk, 
which may be desirable for a prudent risk assessment. (If one uses the largest 1900 observations as 
suggested by a superficial inspection of the Hill plot, then the Hill estimate is not changed much 
and still all points of the qq-plot lie within the confidence band.) 

Figure |4] displays the ML estimator and the Hill estimator for the second component describing 
the other costs. These plots are less stable than the ones for the hospital charges. For both 
estimators it seems certainly advisable to not use many more than 350 largest observations to fit 

the tail; the bootstrap and the sequential procedure suggest to merely use = 125 and k^^ = 90 

* (2) 

largest order statistics for the Hill estimator. As Hill estimate one obtains 7^ (25 ~ 0.495 and a 
95%-confidence interval of about [0.41,0.58]. Hence, apparently the other charges are significantly 
heavier tailed than the hospital charges. 

The ML estimator and the Hill estimator of the coefficient rj of tail dependence between both 
types of costs (based on the rv's T-^"^'' defined by (5.10)) are shown in Figurejsj Here perhaps up to 



(n) 

600 large order statistics of the can be used for the ML estimator. Note that the mathematical 
theory for the data-driven procedures of choosing k has only been developed for the Hill estimator 
based on iid data. Hence, strictly speaking, it is not applicable here, but the aforementioned 
result by Draisma et al. (2004) suggests that in the present situation the sequential estimator, that 
yields k^^ = 318, has the same asymptotic behavior as for iid data. The resulting Hill estimate 
0.63 hints at a rather weak, asymptotically vanishing, but non-negligible dependence, because the 
pertaining confidence interval [0.52, 0.74] does neither contain 1 or 1/2. 

Finally, we consider the estimate d{yi,y2) defined in (5.11). Figure [i] depicts the estimates 



X I— 



d{l/x,l), 0<x<l, 
d(l,l/(2-x)), l<x<2. 
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Prom these values and the estimate fjn one can calculate estimates of 2/2) for all values yi,y2 > 
because of the homogeneity of d of order 



<i(yi,y2) = (min(yi,y2)) ^^"^d 



yi y2 



min(yi,y2)' min(yi,y2) 



In view of (5.8), one may approximate 



d{-,l] « p(Yi^>- 



YiA > U, Yi,2 > u 



X / V X 

p(Xi,i>Ft{l 



X 
U 



X,^i> F^[l--),Xi,2> F^(l ^ 



UJ \ u 



for large u. Observe that in Figure [6j the estimate of this probability is just slightly larger than 
X = P{Yi^i > u/x I Yi^i > u) = P(Xi,i > F^{1 - x/u) I Xi^i > F^{1 - 1/u)). Hence, given 
that the hospital charge exceeds a high threshold, the fact that also the other charges exceed 
an analogously high threshold does not alter the conditional distribution of the hospital charges 
much. The same observation can be made with the roles of hospital charges and other charges 
interchanged. This property should not be confused with the asymptotic independence condition 



(5.7) in which one conditions only at the event that one component is large. Indeed, it can be 
shown that to each rj E (0, 1) and each function g : [0, 2] — )• [0, 1] that is increasing on [0, 1] and 
decreasing on [1, 2] with g{0) = g{2) = and g{l) = 1, one can find a probability distribution 



such that (5.8) holds with 

d{yi,y2) 



\ y2 ^%(yi/y2, i), yi > 2/2, 
1 2/2/2/1), 2/1 < 2/2- 

Hence the function whose estimate is shown in Figure [6] can be combined with any value of the 



coefficient of tail dependence in (0, 1) to obtain a limiting function d in (5.8). (The converse result 
that the function d can be represented in such a way is an easy consequence of its homogeneity; 
cf. e.g. Charpentier and Juri, 2006, Remark 3.4.) 

As we have considered only those claims for which both components are at least $25,000, in 
fact we have analyzed the conditional distribution of the claim sizes given that both components 
are at least $25,000. If instead we use all record for which at least one of the component exceeds 
$25,000, then a different conditional distribution is analyzed. Indeed, for this larger data set 
one obtains higher estimates of the coefficient of tail dependence, indicating a stronger extremal 
dependence between both types of charges. At first glance, this fact seems counterintuitive, because 



in assumption (5.9) only probabilities of events occur in which both components are large. Notice, 
however, that the Yi^i have been calculated by transforming Xj ^ using the marginal cdf Fi, and 
this marginal distribution is different for the two conditional settings described above. For that 
reason, in contrast to the first impression, the parameter r/ and likewise the limiting function d also 
depend on the stochastic behavior of the vector on the regions where just one of the components 
is large. 

7 Proofs 



Proof of Corollary 3.4, For the sake of simplicity, we assume that F is continuous on 



{F^{1 — r]), oo) for some r] > 0, but a slight refinement of the arguments given below shows that 
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the assertion holds without this continuity assumption. With 

^{t;j) := / (l + 7x)-^/^dx 







and Un '■= F^{1 — kn/n), the estimation error can be decomposed as follows: 
' " 'n„(t„,c„)- / l-F{s)ds 



a{kn/n) 



S{Qn) /^/ t„ + C -Q„(l) .^ N _^/ t. + C 



a{kn/n) V V S{Qn) ' V a{kn/n) 

S{Qn) ( , ( tn- A , /tn-^n . 

+ - (^^( a(fc„/n) '^^^"^J - 

V a(kn/n) J \a[kn/n) J V V a[kn/n) J \a(knl'n) 



+ / (1 + 7x)-^/^ - 7r(l - + ~<^nln)x)) dx 

J {tn-Un)/a{k„/n) 

--: la-h + n + in + iv. 



It will turn out that, under the conditions of the corollary, the term III dominates all other terms. 
Because 

^^{t; 7) = r ^ log(l + 7x)(l + 7x)-i/^ - -(1 + ^x)-^^+^/^^ dx 
dl Jo 7 7 

(which is interpreted as Jg x^e~^ dx/2 for 7 = 0), the mean value theorem shows that 
III = {T{Qn) - 7) / ^ log(l + 7nx)(l + 7„x)-i/^" - ^(1 + 7„x)-(i+i/^") 

J (tn—Un)/a{kn/n) In In 

— 1/2 

for some 7^ between T{Qn) and 7, which implies 7n — 7 = Op{kn ). 

First consider the case 7 > 0. Then 1 + — )• 00 uniformly over the range of integration, 
because 



(n(l 


-F(y))A„)-^ 


- 1 




7 




(n(l 


-F(y))A„)-^ 


- 1 




7 




(n(l 


-F(y))/A;„)-^ 


- 1 


7 



d{kn/n) 7 \ n kr, 

(1 + 0(1)) (7.1) 

uniformly for y G tn+Cn] by assumptions (3.16) and (3.17). It follows that x(l+7„x)^(^+^/^") = 
o( log(l + 7„a;)(l + 7nx)~^/^"') and log(l + 7,ix) = 0{\og{n{l — F{tn))/kn)), and thus by condition 
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(|3l7) 



log(l + 7nx)(l + 7nx)-V^" 
log(l + 7x)(l + 7J;)-V7 

log (1 + %#) 



1 + ^^^^ ^ 

log(l + 7x) y V 1 + 7X 



= l + Op(A:-V2iog^_(l-F(t„)) 
= l + op(l) 

uniformly over the range of integration. Therefore, in the case 7 > 0, 7 7^ 1, direct calculations 
yield 

1 r{tn+Cn-u„)/d{kn/n) 

III = (T(Q„)-7)^/ log(l + 7x)(l + 7x)-V7dx(l + op(l)) 

T J{t 

n 

Un)/a(kn/n) 

1 



(r(Q„)- 7)^(1 + Op(l)) X 

T 



T 

1 + 7^^ter ) log (1 + 7^%e7^) - (1 + 7^) log (1 + 7^) 



7-1 



mQn)-7)-(l + Op(l)) X 

7 

,1-7 



- ^(*" + Cn)))'""log (1 - F{tn + Cn))) -[ISI- F{tn)))''''\og (1^(1 - F{tn + c))) 



1-7 

(r(g„)-7)-( 7:^(1 -F(tn))) ' log (-i(i - 



1 - A^-^ 



1-7 



(l + op(l)) 



where in the last but one step again (7.1) has been used. For 7 = 1, analogous arguments yield 



III = (r(Q„)-7) 



it„+c„-u„)/a{k„/n) lQg(^l _|_ 2;) 



(i„-u„)/a(fc„/n) 



1 + X (1 + x)' 



dx(l + op(l)) 



{TiQn) - 7)^ (log' ((^(^ - + ^n))y\l + 0P(1)) 



n 



-1 



log^ hr (1 - (1 + 0P{1)) + O log 7-(l - F{tn)) 



n 



= (r(g„) - 7) log (^(1 - F{tn))) (log A + op(i)). 

In the case 7 = we have 7nX^ = Op ""^^^ log^(n(l — F{tn))/kn) = Op(l) uniformly over the 
range of integration. A Taylor expansion of log(l + 1) and of e~* at i = shows that the integrand 
of III equals 



1 



7n 



1 



In 



In 



1 



In 



1 



^ log(l+7„x) exp ( - ^ log(l+7„x) ) - ^ exp ( - ( 1+^ ) log(l+7„x) ) = ^x^e-^+OpCfc-^/^^^g-x^ 



26 



Hence, in view of (7.1), 



/// = (r(Q„)-7)2 



exp 



+ 0(^^_J^exp 



a{kn/n) 

tn ^71 



a{kn/n) 



(r(g„)-7); 



+ 0( log (f (1 - f (1 - F{tn))) +0(k-'/' logS (f (1 - f (1 - 



a{kn/n) 



tn Un 

a{kn/n) 

+ o(k-'/' 



tn ~\~ C-n 



Ur, 



a{kn/n) 



exp 



tn ~\~ C-n Un 

a{kn/n) 



tn-Un \^ 

a{kn/n) 



exp 



tn Un 

a{kn/n) 



log2 (^(1 - F{tn))) ^(1 - F{tn)) - log2 (^(1 - F{tn + Cn))) ^(1 " F{tn + C^)) 



n 



n 



n 



mQn) - 7) J log' (^(1 - Fitn))) ^(1 - F{tn)){l - A + Op(l)) 
Z \ / Kn 



by assumption ( 3.17 ). 

To sum up, in all cases we have proved that 



/// 



n 



(i-i^(M) 



1-7 1 - A^-^ 

Tn— 

1-7 



(l + op(l)). 



In view of the asymptotic normality of T{Qn), the assertion is obvious if we can show that the 
other terms in the above decomposition of the estimation error are of smaller order. 
To derive an upper bound for the term /;,, note that by (3.7), ( 3.12[ ) and ( |7.1| ) 



tn Qn(l) tn Un 

S{Qn) a{kn/n) 



tn Un ( S(^Qr, 



(tn Un / 
d(k„/n) V 



d{kn/n) \d{kn/n) 



1 + 



(5„(1) - Un\d{kn/n) 



d{kn/n) J S{Qn) 



Opik- 



_^,^ {n{l-F{tn))/kn)-^ -I 



7 



Hence, again using (7.1) we obtain 

which is of smaller order than the term III. Likewise, it can be shown that is asymptotically 
negligible. 

Next, check that by similar arguments 



// = Op{k-y^) 



{tn+Cn-Un)/a(kn/n) 



,/^ {n{l-F{tn))/kn)'-^ -I 

1-7 



(l+7x)-i/^dx+///J =Op(k~ 

which also is of smaller order than the term HI. 

It remains to be shown that the last term IV is asymptotically negligible, too. For y 
Un + d{un)x equation ( |7.1| ) reads as 



^{l-F{Un + diUn)x))) 



-7 



1 



+ o(/c-^/V„(l + 7x)) 



and thus 



n 



(1 - Finn + d{Un)x)) = (1 + 7x)-^/^(l + 0{k-'/^Tn)r'/^ = (1 + jx)-'/^ {l + o{k-'/^Tn)). 
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We may conclude that 



\IV\ < 



Un)/d{k„/n) 



{tn~Un)/a{kn/n) 



n 



(1 - F{un + d{K/n)x)) - (1 + 7x)-i/^ 



dx 



u„)/d{k„/n) 

which is asymptoticahy neghgible compared with III. 



□ 



there exist versions of Qn and W such that 



Proof of Corollary |4.1[ By Theorem |3.1| (with a{t) = 7-F^(l — t)) and Skorohod's theorem 

t-^] -jt-^^+^^W{t) a.s. 



sup 

0<t<l 



Quit) 



F^{l-kn/n) 

A Taylor expansion of the logarithm at 1 yields 

^"^^^-^ - logil+jk-'/H-'W{t) + o{k-'/h-('/'+^^))-lo^ 
^k-'/\t-'Wit) - Wil)) + o(A:-V2i-(i/2+.)) 



uniformly for t G [kn^^^^^''\\], because then kn^^'^t ^W{t) — t- uniformly by the law of the 
iterated logarithm for Brownian motions. Hence 



A/2+e 



k'J'hog^^ + T{Qr.)logt) - ^{t-'Wit) -W{1)) - [ s-^^+'^Wis),^TAds)-logt 



,l/2+.^l/2(r(g„)_^)_ / .-(7+1) 
^ 7(0,1] 





S ^' '-'Ty(s)l/T,7(ds)) logt + o(l) 



uniformly for t G \kn^^^^^'^\ 1] by (3.9). 

To deal with small values of t, recall the following well-known facts about the order statistics 
of iid rv's \Ji that are uniformly distributed over (0, 1): 



t 



sup 



0<t<l "'f^[nt]+l:n 



Op(l), 



sup 

l/(2n)<t<l 



[nt]+l:n 



t 



(7.2) 



and nUk„j^i:n/kn — )• 1 in probability (see, e.g., Shorack and Wellner, 1986, (10.3.7) and (10.3.8)). 
Because {Xn-i-\-i:n)i<i<kn+i has the same distribution as (-F^(l — f^j;n))i<i<fc„+i5 it follows by 



assumption (3.13) that for 



~ F^{l-tx) _^ 

Rit,x) := -X 



(7.3) 
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one has 



log 



Qnjt) 
Qn(l) 



F^(l-fc„/n) 



F^{l-kn/n) 



log 



[fc„t]+l;n\ 



log 



-7 log 



knt 



+ log 1 + 



nUi 



k 

lk„t]+l:n 



log 1 + 



[knt\+l:n 
k 



kn 



V n ' kn 



+ log 1 + op fc„ 



_l/2/ ?^%.t] + l:« \-V2 



+ 7 log - log (1 + op 

Op(l) 



h 

tun 

nUk„+l:n 



-1/2 



uniformly for t G [(2fc„) , A; 



1 



, where in the last step we have used (7.2) which implies 



Op{l). 



Therefore, 



= Op(4/2-(V2+.)/(i+.))^^^(l) 
^ 

in probability uniformly for t E [(2/^^)"^, ^^'■^^^'']- Now assertion (4.2) is obvious from the law 
of iterated logarithm for Brownian motions and the fact that Qn is constant on (0, 1/kn). 
If is bounded, the continuous mapping theorem yields 

sup /,(t)(log|^ + r(Q„)logt 



0<i<l 



sup h{t)(-f{t-^W{t)-W{l))+ / s-'-^+^^W{s)iyT,^{ds) -log t). (7.4) 
0<t<l ^ 7(0,1] ' ^ 



Since Qn is constant on intervals of the form [{i — l)/kn,i/kn), the continuity of h implies 



max h 



i - 1/2 



[fc„to]<j<fc„ VA:„ + 1/2/ 



2,1/2 



-, ^n-i+l-.n , rrfr^ \^ ^ ^ 1^ 

^og— hr(Q„)iog 



X 



n—k„:n 



kn + 1/2 



sup h{t)kU^ 

to<t<l 



log|^ + r(QJlogi|+op(l) 



for all to S (0, 1]. Finally, by the law of iterated logarithm combined with (7.4) 

i - 1/2 



\k„ + 1/2) 



max h. 

i<j<[fc„to] Vfc„ + l/2 

< sup h{t)k]/'^ 
0<t<to 





2,1/2 



log 



^n- 

~x7 



-i+l:n 



n—kn'.n 



+ T{Qn) log 



i - 1/2 



kn + 1/2 



log|^ + r(Q„)logt|+op(l) 
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in probability as to ~^ 0, so that assertion (4.3) follows. The last assertion is immediate from 
'^TH,-y{ds) = ^{s'^ds — ei{ds)) with ei denoting the Dirac measure in 1 (cf. Drees (1998b), Example 



3.1). 

Proof of Corollary 14. 2[ Note that 

F^{l-tx) 



exp 



r?(s) 



ds 



tx 



□ 



(7.5) 



so that condition (3.13) reads as 
kl/"^ sup 

0<X<l+£ 



exp 



^ r]{skn/n) 



ds 



< k]!'^ sup 

0<a:<l+e 

^ 0. 



exp 



sup |^('S)| logx ) — 1 

0<s<(l+e)fcn/n 



In view of (4.5) this condition is fulfilled, and so convergence (4.2) holds. 



By the law of the iterated logarithm 

m 



t ^ ' t 

0{h{t){l - t)V2 logl/2 I log(l _t)\)+ 0{h{t){l - t)) 

a.s. 



as 1 1 1- Hence, in view of (4.2), it suffices to prove that 



sup h{t) \og^^^+T{Qn)\ogt 





1/2, 



(7.6) 



in probability for all sequences tn t 1- Because h{t)\ogt ^ 1 as t f 1 and kn {T{Qn)-'y) = Op{l) 
(7.6) would follow from 



sup h{t) logf^^t^ 

t„<t<l-{2A:„)-i ^Wn{i-) 



a/2 



(7.7) 



To establish (7.7), one may argue similarly as in the second part of the proof of Corollary 
4.1 using the uniform tail empirical quantile function, but it is easier to work with a Hungarian 
construction for partial sums Si := with ^j, j £ N, denoting iid standard exponential 

rv's. More concretely, for suitable versions of there exist a Brownian motion W such that 
maxi<j<fc^+i |5i — i — W{i)\ = 0(logfc„) a.s. Moreover, the variational distance between the 
distribution of {Xn-i+i:n)i<i<k„ and the distribution of {F'^{1 — Si/n))i<i<k„ tends to (Reiss, 
1989, Theorem 5.4.3). Hence, to verify (|7.7[), it suffices to prove that 



1.1/2 



sup 

t„<t<l-(2fc„)-i 



h{t) 



log ( 



< A^y^ sup h{t)-f 

t„<t<l-(2A:„)-i 

->■ 



F-(l-Sfc„+i/n) 



log 



tS, 



+ fey^ sup h{t) 

t„<t<l-{2fc„)-l 



\vis)\ 



S[knt]+i/n 



ds{7.8) 
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in probability. 

The Hungarian construction and the law of iterated logarithm imply 



log 



Si 



[knt] + l 



log 



[knt]+Wi[knt] + l) + Oilogkn) 
kn + W(kn + l)+0{l0gkn) 



log t - log IH + O 



i-t + o{{i-ty) + o 



log log A;„,\1/2n 



+ log(l + ^^^%±ll + 0^^°^'" 



krt 



Thus the second term in (7.8) can be bounded by 

kl^^ sup |r/(s)| • sup h{t)(l-t + 0{{l-tf) + 

0<s<{l+e)k„/n t„<t<l-{2k„)-^ ^ 

which tends to because of ( |4.5[ ) and h{t){l - tY/'^~^ — > as t f 1- 
Moreover, 



log log k„ \ 1/2 



kri 



log 



5, 



[knt] + l 



log 1 



where 



W{[knt] + 1)-W{kn + 1) =^ kl/^w(l 



W{[knt] + 1) - Wjkn + l)t + Q(log fc„ 

A:„ + 0(A:;^'/' log^/^ bg A:„) 



o(e(l-^)''^ogV^|log(l 
0(A:y2(i_t)i/2iogi/2|iog(i_t)|) 



[knt] 
kn 



uniformly for tn <1 — (2A;„) . Hence, the first term in (7.8) is of the stochastic order 

Z,l/2 



"n sup h{t) 

t„<t<l-(2fc„)-l 



W{[knt\ + 1) - W{kn + 1) + PF(A;„ + 1)(1 - t) 



Op[ sup 

■t„<t<l-(2A:„)-i 





+0(fc-i(l - t) log I log(l - t)| + fc-i log2 k^ 
h{t){l-tf'Hog^'^\\og{l-t)\ 



in probability, which completes the proof of (4.6). The second assertion follows exactly as in the 
proof of Corollary 4.1 □ 
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Figure 1: Hill estimator based on A; + 1 largest order statistics versus k (above) and versus 
log /c/ log n (below) for n = 1000 iid Frechet rv's (left) and logarithmically disturbed Pareto rv's 
(right) with 7 = 1/2. Estimated optimal numbers k^°^ (dashed), k^'^ (solid) and k^^ (dash- 
dotted) are indicated by vertical lines. 
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Figure 2: ML estimator (solid line) and Hill estimator (dashed line) and the Hill estimator applied 
to the data shifted by $ 300,000 (dash-dotted line) based on A; + 1 largest hospital charges versus 
k] the estimated optimal number for the Hill estimators are indicated by vertical lines 
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Figure 4: ML estimator (solid line) and Hill estimator (dashed line) based on /c + 1 largest other 
charges versus k 




Figure 5: ML estimator (solid line) and Hill estimator (dashed line) for r/ based on k + 1 order 
statistics of T^"^ versus k 
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Figure 6: Estimator of d{l/x, 1) (0 < x < 1) and 1/(2 — x)) (1 < x < 2)(solid line); pointwise 
asymptotic 95%-confidence intervals are indicated by dashed lines; for comparison, the lines x i— )• x 
and X I— >• 2 — X are shown by dotted lines 
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